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I  Introduction 


As  early  as  in  1968  the  author  wrote  and  discussed  the  conceivable  non-monotonicity  of  the  operating 
characteristic  of  the  correct  answer  of  the  multiple-choice  test  item,  which  is  based  strictly  upon  theory 
(cf.  Samejima,  1968).  Since  then,  such  a  phenomenon  has  actually  been  observed  with  empirical  data. 
For  example,  Lord  and  Novick  reported  such  a  curve  when  they  plotted  the  percent  of  the  correct  answer 
against  the  test  score  for  each  item  as  an  approximation  to  the  item  characteristic  function  (cf.  Lord 
and  Novick,  1968,  Chapter  16).  Since,  as  their  Theorem  16.4.1  states,  the  average,  over  all  items,  of 
the  sample  item-test  regressions  falls  along  a  straight  line  through  the  origin  with  forty-five  degree  slope, 
such  a  dip  cannot  be  detected  for  an  easy  item  even  if  it  exists,  as  far  as  we  use  the  item-test  regression 
as  an  approximation.  It  is  quite  possible,  therefore,  that  there  are  more  than  one  item  among  those 
items  that  have  such  “dips";  only  they  were  not  detected. 

In  recent  years,  several  more  direct  approaches  of  estimating  operating  characteristics  have  revealed 
such  “dips”  among  ASVAB  items.  Partly  because  of  the  availability  of  computer  software,  such  as 
Logist  (Wingersky,  Barton  and  Lord,  1982),  Bilog  (Bock  and  Atkin,  1981),  etc.,  however,  it  is  a  common 
procedure  among  researchers  that  they  mold  non-monotonic  operating  characteristics  of  correct  answers 
into  the  three-parameter  logistic  model,  ignoring  the  non-monotonicity.  In  some  cases,  even  strategies 
are  taken  so  that  distractors,  which  cause  the  non-numotonicity,  are  considered  as  undesirable  ones  and 
are  replaced  by  some  other  non-threatening  alternative  answers. 

A  question  must  be  raised  as  to  whether  this  strategy  is  wise.  In  the  present  paper,  this  issue  will 
be  discussed  both  from  theory  and  from  practice,  and  a  new  strategy  of  writing  test  items,  which  leads 
to  more  efficient  ability  estimation,  will  be  proposed.  It  will  take  advantage  of  the  ease  in  handling 
mathematics  attributed  to  parameterization,  and  yet  minimize  the  effect  of  noise  caused  by  random 

guessing. 


II  Non-Monotonicity  of  the  Conditional  Probability  of  the 
Positive  Response,  Given  Latent  Variable 

This  section  is  basically  the  essence  or  a  summary  of  the  paper  published  by  the  author  more  than 
twenty  years  ago  (Samejima,  1968),  as  one  of  the  research  reports  of  the  L.  L.  Thurstone  Psychometric 
Laboratory  of  the  University  of  North  Carolina.  The  content  of  the  paper  was  a  protocol  which  led  to 
the  proposal  of  a  neu>  family  of  models  for  the  multiple-choice  test  item  (Samejima,  1979b).  The  author 
believes  that  this  paper  published  in  1968  still  gives  new  ideas  to  today’s  research  communities. 

The  paper  deals  with  the  nominal  response,  and  also  multiple-choice  situations,  in  which  examinees 
are  required  to  choose  one  of  the  given  alternatives,  in  connection  with  the  graded  response  model  (cf. 
Samejima,  1969,  1972).  Let  0  denote  the  latent  variable,  or  ability,  which  assumes  real  numbers.  Let 
g  (=  1,  2, ...,  r>)  denote  an  item,  kg  be  a  discrete  response  to  item  g,  and  Pk,{6)  denote  the  operating 
characteristic  of  the  discrete  response  kg  ,  or  the  conditional  probability,  given  &  ,  with  which  the 
examinee  responds  to  item  g  with  kg  .  Throughout  the  paper  the  principle  of  local  independence  is 
assumed  to  be  valid,  so  that  within  any  group  of  examinees  all  characterized  by  the  same  value  of  the 
latent  variable  6  the  distributions  of  the  item  response  categories  are  all  independent  of  each  other. 
Thus  the  operating  characteristic  of  a  given  response  pattern  is  a  product  of  the  operating  characteristics 
of  the  item  response  categories  contained  in  that  response  pattern  (cf.  Lord  and  Novick,  1968). 

For  a  multiple-choice  item  a  certain  number  of  false  answers  are  given  in  addition  to  the  correct 
answer.  In  a  general  case  it  is  impossible  to  score  them  in  a  graded  manner  in  accordance  with  their 
degrees  of  attainment  toward  the  goal.  Thus  the  multiple-choice  situation  should  be  treated  as  a  special 
instance  of  the  nominal  level  of  response,  although,  in  addition,  the  problem  of  random  or  irrational 
choice  should  be  investigated. 
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Confining  discussions  to  examinees  who  have  responded  to  item  g  incorrectly,  there  can  be  diversity 
of  false  answers  if  they  have  responded  to  it  freely,  without  being  forced  to  choose  one  of  a  set  of 
alternative  answers.  It  is  conceivable  that  some  of  the  false  answers  may  require  high  levels  of  ability 
measured  while  some  others  may  not,  some  may  be  related  to  the  ability  measured  strongly  while  some 
others  may  not,  etc.  An  objective  measure  of  the  plausibility  of  a  specified  false  answer  is  its  operating 
characteristic,  i.e.,  the  probability  of  its  occurrence  defined  for  a  fixed  value  of  ability  8  ,  and,  therefore, 
expressed  as  a  function  of  9  . 

Let  M,(9 )  be  a  sequence  of  the  conditional  probabilities  corresponding  to  the  cognitive  subprocesses 
required  in  finding  the  plausibility  of  response  kg  to  item  g  ,  and  Ukg  (5)  be  the  conditional  probability 
that  an  examinee  discovers  the  irrationality  ol  response  kg  as  the  answer  to  item  g  ,  on  condition 
that  he  has  already  found  out  its  plausibility.  The  operating  characteristic  of  kg  ,  which  is  denoted  by 
P)ca(0)  ,  can  be  expressed  by 

(2.1)  ft,(»)=ii -%.(«)]  ■ 

since  it  is  reasonably  assumed  that  an  examinee  who  gives  a  response  kg  to  item  g  is  one  who  has 
succeeded  in  finding  kg  ' s  plausibility,  and  yet  failed  in  finding  its  irrationality.  We  notice  that  this 
formula  is  exactly  the  same  in  its  structure  as  the  definition  of  PIg(9)  on  the  graded  response  level, 
where  M,(8)  is  replaced  by  Af,(0)  and  Ukg(9)  is  replaced  by  AfjIs  +  1|(0)  (cf.  Samejima,  1972). 
Defining  Mkg(8)  sueh  that 

(2.2)  Mkg{6)=Y[M.(6)  , 

we  can  rewrite  (2.1)  into 

(2-3)  ft,(*)  =  A4,(*)|l-tM*)]  ' 


It  will  reasonably  be  assumed  from  their  definitions  that  both  Mkg(8)  and  Ukg(8)  be  strictly 
increasing  in  9  ,  provided  that  a  specified  response  kg  is  a  good  mistake  in  the  sense  that  the 
discoveries  of  its  plausibility  and  irrationality  are  properly  related  with  ability  8  .  It  will  also  be 
reasonably  assumed  that  the  upper  asymptotes  of  Mkg(8)  and  Ukg(9)  are  unity,  and  the  lower 
asymptote  of  A fkg(8)  is  zero. 

We  assume  that  both  Mkg(9)  and  Ukg(6)  are  three-times-differentiable  with  respect  to  8  .  It  is 
easily  observed  that,  in  order  to  satisfy  the  unique  maximum  condition  (Samejima,  1969,  1972),  Pkg{8) 
defined  by  (2.3)  must  fulfill  the  following  inequalities: 

(24)  ^log  Mkg(8)  =  ~\~Mkg(8){Mk3(8)rl)  <  0 

and 

(2-5)  ^^\l-Ukg(9)\=^\~±Uka[8){l-Uks[8)}-1)<0  . 

(For  proof,  see  Samejima,  1968.)  Note  that  in  this  case  the  lower  asymptote  of  Ukg(8)  need  not  be  sero. 
The  operating  characteristic  of  a  specified  response  kg  which  satisfies  the  unique  maximum  condition 
was  called  the  plausibility  curve  (Samejima,  1968),  and  later  the  plausibility  function  (cf.  Samejima, 
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1984a).  As  the  condition  suggests,  the  plausibility  curve  is  necessarily  unimodal.  A  schematised  hy¬ 
pothesis  for  the  plausibility  curve  will  be  the  following.  The  probability  that  an  examinee  will  find  the 
plausibility,  but  will  fail  in  discovering  the  irrationality,  ot  a  specified  response  kg  as  the  answer  to 
item  g  is  a  function  of  ability  6  ;  it  increases  as  ability  8  increases,  reaches  maximum  at  a  certain 
value  of  6  ,  and  then  decreases  afterwards.  If  an  item  provides  many  such  responses,  their  plausibility 
curves  will  be  powerful  sources  of  information  in  estimating  examinees’  abilities.  That  is  to  say,  we  can 
make  use  of  specific  wrong  answers  to  an  item  as  sources  of  information,  as  well  as  the  correct  answer 

Let  P0(8)  denote  the  operating  characteristic  of  the  correct  answer  of  a  dichotomous  item  g  in 
the  free-response  situation.  Let  Pg{8)  be  the  same  function,  but  in  the  multiple-choice  situation.  The 
conventional  three-parameter  model  is  represented  by 

(2  6)  P;(S)  =  Cg  +  (\-Cg)Pg(6)  , 

where  cg  is  the  probability  with  which  an  examinee  will  guess  correctly  (Lord  and  Novick,  1968). 
This  is  a  monotonically  increasing  function  of  8  with  cg  (>  0)  and  unity  as  its  lower  and  upper 
asymptotes,  provided  that  Pg(8)  is  strictly  increasing  in  8  with  sero  and  unity  as  its  lower  and  upper 
asymptotes. 

The  psychological  hypothesis  which  has  led  to  the  formula  (2.6)  in  the  multiple-choice  situation  is 
the  following  If  an  examinee  has  ability  8  .  then  the  probability  that  he  will  know  the  correct  answer 
is  given  by  Pa(8 )  ;  if  he  does  not  know  it,  he  will  guess  randomly,  and,  with  probability  c„  ,  will  guess 
correctly  (Lord  and  Novick,  1968).  Thus  we  have  for  the  operating  characteristic  of  the  correct  answer 
of  item  g  in  the  multiple-choice  situation 

(2-7)  Pt(9)  +  jl  -  P9V))ca  , 

which  leads  to  (2.G).  This  hypothesis  may  not  necessarily  be  appropriate  for  ability  measurement.  One 
can  never  tell  in  the  measurement  of  a  reasoning  ability,  for  instance,  whether  an  examinee  knows  the 
correct  answer  to  item  g  or  not,  until  he  has  tried  to  solve  it.  He  may  respond  with  an  incorrect 
alternative  without  guessing  at  all.  To  explain  such  a  case  we  need  some  other  hypothesis  than  the  one 
which  leads  to  the  formula  (2.6). 

Hereafter,  we  assume  that  Pg(8)  is  strictly  increasing  in  9  with  rero  and  unity  as  its  lower  and 
upper  asymptotes,  and  is  twice-differentiable  with  respect  to  8  .  Suppose,  further,  that  both  Pe{8) 
and  [1  -  Pg(9) j  satisfy  the  unique  maximum  condition.  In  this  case  P' (8 )  defined  by  (2.6)  does 
not  satisfy  either  of  Conditions  (i)  and  (ii)  for  the  unique  maximum,  unless  cg  is  iero,  i.e. ,  the  free- 
response  situation,  although  they  are  fulfilled  for  the  negative  answer  to  item  g  (cf.  Samejima,  1973). 
Observations  and  discussion  are  made  (Samejima,  1968)  giving  two  simple  cases  of  the  multiple-choice 
situation  as  examples.  In  those  examples,  only  two  items  tire  involved,  and  the  response  pattern,  (1,0), 
is  solely  treated,  and  precise  mathematical  derivations  are  given. 

A  possible  correction  for  the  conventional  functional  formula  for  *he  operating  characteristic  of  the 
correct  answer  of  a  multiple-choice  item  can  be  made  by  introducing  the  probability  of  random  guessing 
defined  for  a  fixed  value  of  8  .  Let  dg(8)  denote  this  probability.  A  reasonable  assumption  for  this 
function  may  be  that  it  be  non-increasing  in  8  Thus  the  probability  with  which  an  examinee  of  ability 
8  will  answer  item  g  correctly  by  following  the  due  cognitive  process  is  expressed  by  [1  —  dg(8)]Pg(8)  ; 
and  the  one  with  which  he  will  give  the  correct  answer  by  guessing  should  be  dg(8)cg  .  For  economy  of 
notation,  let  Pg{8)  be  the  operating  characteristic  of  the  correct  answer  to  item  g  in  the  corrected 
functional  formula  also.  We  can  write 

(2.8)  P‘(8)  =  [1  -  dg(6))Pg(6)  +  dg(6)Cg 
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Pam  +  da[t)[cg  -  Pg[6) ]  . 


A  schematized  psychological  hypothesis  which  leads  to  this  formula  is  as  the  following.  If  an  examinee 
has  ability  $  ,  then  he  will  depend  upon  random  guessing  in  answering  item  g  with  probability  dg(8 )  ; 
in  that  case,  the  conditional  probability  with  which  he  will  guess  correctly  is  given  by  cg  .  If  he 
does  not  depend  upon  random  guessing,  he  will  try  to  solve  the  item  by  the  due  cognitive  process, 
and  will  succeed  in  solving  it  with  probability  Pg(9)  .  Thus  according  to  this  functional  formula  the 
probability  with  which  an  examinee  will  respond  with  an  incorrect  alternative  without  guessing  is  given 
by  [l  —  d„(0)][l  -  Ptf(#)]  ,  which  is  nil  in  the  model  represented  by  the  formula  (2.6). 

We  can  conceive  of  several  factors  which  may  affect  the  functional  formula  for  dg(9)  .  The  difficulty 
of  item  g  may  be  one  of  them;  the  discriminating  power  may  be  another;  the  number  of  alternatives 
attached  to  item  a  may  also  affect  the  probability,  i.e.,  it  may  be  that  the  fewer  the  number  of  alter¬ 
natives,  the  more  tempted  to  depend  upon  random  guessing  an  examinee  will  be;  also  the  plausibilities 
of  the  alternatives  may  be  counted  as  a  factor. 

In  a  simplified  case  where  dg(8)  is  constant  throughout  the  whole  range  of  8  ,  we  can  rewrite  (2.8) 
in  the  following  form. 


(2.9)  Pg(9)  =  dgcg  +  |1  -  dg\Pg(9)  . 

This  is  somewhat  similar  to  formula  (2.6),  the  conventional  functional  formula  for  the  operating  char¬ 
acteristic  of  the  correct  answer  of  a  multiple-choice  item.  The  lower  asymptote  of  the  present  function 
is  dgcg  (<  c„)  ,  however,  while  it  is  cg  in  (2.6);  the  upper  asymptote  of  the  present  function  is 
[I  -  dg{  1  -  cg)j  ,  which  can  be  lea  than  unity,  while  it  is  unity  in  (2.6).  In  a  special  case  where  dg  —  0  , 
that  is,  an  examinee  tries  to  solve  item  g  by  proper  reasoning  with  probability  one,  (2.9)  reduces 
to  Pg{8)  ,  the  operating  characteristic  of  the  correct  answer  in  the  free-response  situation.  In  another 
specie!  case  where  dg  =  1  ,  that  is,  an  examinee  depends  upon  random  guessing  with  probability  one, 
(2.9)  reduces  to  a  constant,  cg  .  In  the  more  general  case  where  dg(8)  varies  as  8  varies,  it  is  observed 
from  (2.8)  that 


'  0  <  Pg(9)  <  P‘(9)  <  cg 

;  if 

9  <  8, 

(2.10) 

W  =  c„  =  Pg(8) 

;  if 

9  =  9, 

where 

ca  <  p;(6)  <  Pg(6)  <  1 

k 

;  */ 

9  >  9 , 

(2.11) 

» 

provided  that  c„  is  greater  than  zero.  This  result  is  quite  natural,  since  it  is  reasonably  assumed  that 
the  probability  of  success  in  solving  item  g  will  decrease  by  random  guessing  if  the  one  attained  by 
the  due  cognitive  process  is  higher  than  the  one  attained  by  random  guessing,  and  it  will  increase  by 
random  guessing  if  the  latter  probability  is  higher  than  the  former.  If  we  assume  that  the  asymptotes 
of  dg(8)  in  negative  and  positive  directions  be  unity  and  zero,  respectively,  we  will  obtain  cg  and 
unity  as  the  lower  and  upper  asymptotes  of  Pg  (9)  ■  Figure  2-1  presents  two  examples  of  the  operating 
characteristic  given  by  (2.8)  where  cg  is  0.2,  using  two  different  dg(6)  's  .  Note  there  is  a  “dip”  on 
the  lower  part  of  the  curves  for  Pg{9)  .  These  two  dg(9)  's  are  identical  for  the  lower  levels  of  8  , 
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but  differ  on  the  upper  levels,  with  the  upper  asymptotes  0.0  and  0.1,  respectively.  In  these  examples, 
therefore,  the  upper  asymptote  of  Pg(8)  is  unity  in  the  first  example,  and  0.92  in  the  second,  i.e.,  the 
conditional  probability  for  the  correct  answer  never  approaches  unity  however  high  the  ability  may  be. 

If  da(6)  is  differentiable,  Pg(8)  is  also  differentiable,  and  from  (2.8)  we  have 
(2.12)  ^p;(9)  =  [i-dg(e)\j-9Pg(e)  +  {cg-Pg(e)\-^dg(6)  . 

Thus  it  is  obvious  that  Pg{8)  is  strictly  increasing  in  8  for  the  range  8  >  0O  >  if,  and  only  if,  da[9) 
is  less  than  unity  for  the  range  of  6  satisfying  8  >  80  .  Thus  in  this  case  P‘  (9)  is  non-decreasing  in 

8  throughout  its  whole  range.  In  general,  Pg(8)  equals  cg  and  presents  a  horizontal  line  as  far  as 

dg(8)  is  unity,  and  then  increases  for  the  rest  of  the  range  as  8  increases. 

As  for  the  range  expressed  by  8  <  8q  ,  Pg(8)  equals  cB  regardless  of  the  value  of  Pg[8)  for  the 
values  of  8  for  which  da(8)  is  unity,  and  is  some  positive  value  less  than  cg  otherwise.  If  dg(8)  is 
unity  throughout  this  range  of  8  ,  Pg(8)  presents  a  horizontal  line  for  this  range.  If  dg(8)  is  unity 

for  the  negative  extreme  value  of  8  ,  but  de(8)  takes  on  some  values  less  than  unity  for  a  subset  of 

8  of  this  range,  PJ(8)  has  at  least  one  local  minimum.  If  dg(8)  is  less  than  unity  for  the  negative 
extreme  value  of  8  ,  Pg  (8)  can  be  strictly  increasing  in  8  ,  non-decreasing,  or  have  one  or  more  local 
minima,  in  accordance  with  the  functional  formula  for  dg(&)  . 

It  is  obvious  that  any  operating  characteristic  having  local  minima  does  not  satisfy  the  unique 
maximum  condition  (Samejima,  1969,  1972),  and  neither  does  the  one  whose  first  derivative  equals  zero 
at  some  value  of  8  ...  In  the  case  of  Pg  (8)  defined  by  (2.8)  we  can  prove  that,  in  general,  it  does  not 
satisfy  the  unique  maximum  condition,  even  if  it  is  strictly  increasing  in  8  .  (For  proof,  see  Samejima, 
1968.) 

Two  characteristics  of  the  model  represented  by  (2.8)  are  that  it  allows  “dips”,  and  also  a  smaller 
value  than  unity  for  the  upper  asymptote  of  the  operating  characteristic  of  the  correct  answer,  as 
Fic'ure  2-1  il In «f rates.  In  th»«e  examples,  there  is  only  on»  “d:p”  on  the  lower  level  of  8  .  There  cam 
be  more  than  one,  however,  and  an  example  is  presented  elsewhere  (Samejima,  1968).  In  many  cases 
the  model  may  describe  the  real  operating  characteristic  of  the  correct  answer  more  closely  than  the 
three-parameter  model. 

It  has  been  reported  by  several  reseaichers  that  they  have  come  across  estimated  operating  char¬ 
acteristics  of  correct  answers  that  do  not  converge  to  unity,  but  to  some  other  values  less  than  unity. 
Note  that  the  general  model  described  above  can  handle  such  situations,  although  most  ui  me  omei 
models  proposed  by  different  researchers  so  far  cannot. 

We  notice  that  neither  (2.2)  nor  (2.8)  explicitly  takes  into  consideration  the  influences  of  separate 
distractors.  Suppose  an  examinee  A  has  chosen  to  solve  item  g  by  reasoning,  i.e.,  without  guessing,  and 
has  reached  an  answer  which  is  not  correct.  Suppose,  further,  that  this  specified  response  is  not  given 
as  an  alternative  answer  to  this  item.  Then  either  he  will  decide  to  give  an  answer  by  guessing,  or  he 
will  try  to  solve  the  item  by  reasoning  all  over  again.  To  account  for  these  possibilities,  we  would  have 
to  give  practically  all  the  different  plausible  responses  to  item  g  as  its  alternatives,  which  is  practically 
impossible,  since  the  number  of  alternative  answers  is  more  or  less  restricted.  In  contrast  to  this,  it 
is  interesting  to  note  that  the  psychological  hypothesis  behind  the  three-parameter  logistic  model  may 
be  more  realistic  in  the  case  where  no  very  plausible  responses  except  for  the  correct  answer  to  item 
g  are  given  as  its  alternative  answers.  Thus,  even  if  an  examinee  has  reached  a  specified  plausible 
response  other  than  the  correct  answer,  he  may  turn  to  random  guessing  simply  because  he  cannot  find 
that  specified  answer  among  the  alternatives.  Such  a  situation  has  another  serious  problem,  however, 
since  it  is  likely  for  an  examinee  who  is  highly  alternative-oriented  to  choose  the  correct  answer  without 
much  reasoning  or  guessing,  simply  because  the  other  alternatives  are  too  ridiculous  to  be  the  answer 
to  the  item.  As  the  result,  the  operating  characteristic  of  the  correct  answer  may  be  deform*.-*  oo  that 
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FIGURE  2-1 

ReUtionnhipe  among  Pt{8)  ,  dt{8)  and  P^(8)  U«ing  Two  Different  ^(8)'*. 
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it  hie  a  lower  difficulty  and  less  discriminating  power.  Plausible  answers  as  dutractors  are  necessary  as 
alternativ-»  in  order  not  to  destroy  the  nature  of  the  item. 

It  •*  conceivable  that  the  plausibilities  of  the  alternatives  attached  to  item  g  ether  than  the  correct 
a/:  wer  will  be  one  of  the  factors  affecting  the  probability  of  random  guessing  in  the  multiple-choice 
situation.  As  distinct  from  the  discussion  developed  in  the  preceding  section,  here  we  shall  suppose  that 
an  examinee  will  try  to  solve  the  item  following  proper  cognitive  processes  at  the  beginning,  and  only 
in  the  case  where  he  has  reached  an  answer  which  is  not  given  as  an  alternative,  or  where  he  has  failed 
to  find  any  answer  at  all,  he  will  guess. 

Let  kg  or  hg  denote  a  specified  response  to  item  g  which  is  given  as  an  alternative,  including  the 
correct  answer,  and  Pks ( 6 )  or  Pkl(8)  be  its  operating  characteristic  in  the  free-response  situation 
It  may  reasonably  be  assumed  that  Pkj(0)  i*  leas  than  or  equal  to  unity  for  any  fixed  value  of 

8  .  Let  Pk  (0)  or  Pk  (0)  denote  the  operating  characteristic  of  a  specified  alternative  kg  or  hg  in 
the  multiple-choice  situation,  and  ck>  or  cka  be  the  probability  of  choosing  kg  or  hg  by  guessing, 
which  satisfies 


(2-1S)  =  1  ' 

kf 

Thus  we  can  write 

(2 14)  /»;,(#)  =  +  <*, 

\ 

for  any  kg  ,  and.  by  using  the  notation  for  the  correct  answer  as  we  did  in  the  previous  sections,  we 
obtain 


(-1  15)  . 

h. 

It  is  worth  noting  that  we  have  specified  not  only  the  operating  characteristic  of  the  correct  answer 
in  the  multiple-choice  situation,  but  also  of  each  distractor.  The  utility  of  the  operating  characteristic 
of  each  wrong  alternative  answer  in  the  estimation  of  an  examinee’s  ability,  as  well  as  the  one  of  the 
correct  response,  is  suggested,  and  this  is  a  feature  of  the  present  discussion. 

It  has  been  made  clear  that,  in  general,  Pg  (8)  dots  not  satisfy  the  unique  maximum  condition 
regardless  of  the  functional  formulae  for  the  plausibility  curves  of  the  distractors.  As  for  the  alternatives 
other  than  the  correct  answer,  it  can  easily  be  shown  that,  in  general,  Pk  (8)  does  not  satisfy  the 
unique  maximum  condition  (cf  Samejima,  1968,  1979b). 

For  the  purpose  of  illustration,  Figure  2-2  presents  a  simple  example  in  which  only  two  alternatives, 
the  correct  answer  and  one  incorrect  response,  are  given.  In  this  example,  Pk  (6)  for  the  wrong  answer 
is  drawn  from  the  ceiling  in  order  to  make  the  picture  visibly  understandable  A  normal  ogive  function 
given  by 


(2  16)  Pg(0)  =  J  exp{-u2/2}  du 

where  aB  =  1/1.48  and  bg  =  0  36  is  used  as  the  operating  characteristic  of  the  correct  answer,  and 
the  lame  formula  is  applied  for  Ukt(8)  and  Mki(0)  for  the  incorrect  response.  The  corresponding 
values  of  parameters  are  1/1.23  and  -1.84  for  Mk,  (?)  ,  and  1/1.51  and  -0.83  for  Ukf(8)  .  The  value  of 
eB  ,  as  well  as  that  of  ckf  for  the  incorrect  answer,  is  0.5. 
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It  is  obvious  from  the  above  observations  and  discussion  that  these  are  the  fundamental  philosophies 
which  led  to  the  proposal  of  the  “new”  family  of  models  for  the  multiple-choice  test  item  (Samejima, 
1979b).  These  philosophies  will  provide  us  with  the  idea  of  content-based  observation  of  informative 
distractors  and  strategies  of  writing  test  items,  which  will  be  proposed  in  a  later  section.  The  general 
model  described  here  is  called  Informative  Distractor  Model,  in  contrast  with  the  Equivalent  Distractor 
Model,  to  which  the  three-parameter  model  represented  by  (2.6)  belongs  (cf.  Samejima,  1979b). 


Ill  Effect  of  Noise  in  the  Three-Parameter  Logistic  Model 
and  the  Meanings  of  the  Difficulty  and  Discrimination 
Parameters 

Three-parameter  logistic  model  for  the  multiple-choice  test  item  is  represented  by 


(3.1) 


JV(*)  =  ce  +  (l-Cfl)[l  +  exp{-.Da1,(0-fc(,)}r1  , 


where  ag  ,  ba  ,  and  are  the  item  discrimination,  difficulty,  and  guessing  parameters,  and  D  is  a 
scaling  factor,  which  is  set  equal  to  1.7  when  the  logistic  model  is  used  as  a  substitute  for  the  normal 
ogive  model.  When  ce  =  0  ,  this  formula  provide  us  with  the  operating  characteristic  of  the  correct 
answer  in  the  original  logistic  model. 

It  is  still  a  common  procedure  among  researchers  to  adopt  the  three-parameter  logistic  model  for 
their  multiple-choice  test  items  and  compare  the  resulting  estimated  discrimination  parameters,  or  the 
difficulty  parameters,  across  different  items.  An  important  fact  that  is  overlooked  is  that  this  is  not 
legitimate,  for  the  addition  of  the  third  parameter  ca  makes  the  other  two  item  parameters  lose  their 
original  meanings.  If  aa  =  1.00  and  cv  =  0.25  in  the  three-parameter  logistic  model,  for  example,  this 
corresponds  to  ag  =  0.75  in  the  logistic  model  in  the  maximum  discrimination  power.  If,  in  addition 
to  these  parameter  values,  bg  =  0.00  ,  then  the  difficulty  level  for  the  three-parameter  logistic  model 
defined  as  the  level  of  6  at  which  chances  for  success  axe  0.5  is  -0.4077336  ,  i.e. ,  substantially  lower 
than  0.00  . 

In  general,  we  can  write 


(3.2) 


aa  =  (1  ~  cu) 

fig  =  bg  -f  (Dcij,)'1  log  (1  -  2 Cg)  , 


where  au  denotes  the  actual  discrimination  power  and  fig  is  the  actual  difficulty  level  in  the  three- 
parameter  logistic  model  As  we  can  see  in  (3.2),  the  effect  of  the  third  parameter  ca  can  be  substantial, 
both  on  the  discrimination  power  ag  and  on  the  difficulty  index  f3g  .  Thus  the  simple  comparison  of  the 
values  of  as  for  two  or  more  test  items  having  different  values  of  the  lower  asymptote  cg  is  illegitimate 
and  can  be  harmful,  for  the  factor  (l  —  c„)  may  affect  the  value  of  ag  ,  the  real  discrimination  power, 
substantially.  As  for  the  difficulty  index,  since  the  second  term  on  the  right  hand  side  of  the  second 
equation  of  (3.2)  is  always  negative  for  0  <  cg  <  0.5  ,  this  term  represents  the  amount  of  decrement 
of  the  difficulty  level.  Note  that  as  ce  tends  to  0.5  ,  approaches  negative  infinity!  (If  c„  >  0.5 
then  0g  does  not  even  exist.)  The  illegitimacy  of,  and  the  danger  in,  comparing  b0  's  across  two  or 
more  test  items  having  different  lower  asymptotes  ca  is  even  more  obvious  for  the  difficulty  index. 


Figure  3-1  presents  the  operating  characteristic  of  the  correct  answer  in  the  normal  ogive  model  with 
a„  =  1.00  and  be  =  0.00  by  a  dotted  line,  the  one  in  the  logistic  model  with  the  same  parameters  and 
the  scaling  factor,  D  =  1.7  ,  by  a  solid  line,  and  the  one  in  the  three-parameter  logistic  model  with 
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THETA 


FIGURE  3-1 

Operating  Characteristics  in  the  Normal  Ogive  Model  (Dotted  Line),  in  the  Logistic  Model 
(Solid  Line)  and  in  the  Three- Parameter  Logistic  Model  (Dashed  Line),  with  the 
Parameters  ag  —  1.0  ,  bg  =  0.0  ,  cg  =  0.25  and  the  Scaling  Factor  D  =  1.7  . 
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the  same  two  item  parameters  and  scaling  factor  and  the  third  parameter,  cg  —  0.25  ,  by  a  dashed 
line.  It  is  obvious  from  theory  that  for  all  the  three  operating  characteristics  of  the  correct  answer  the 
derivatives  are  highest  at  6  =  bg  =  0.0  .  Actually,  these  three  derivatives  are:  (2 ‘ir)~1l'iag  ,  Dagj 4 
and  (1  —  cg)Dag/i  ,  respectively,  for  the  three  functions  in  Figure  3-1.  The  ratio  of  this  maximal  slope 
in  the  normal  ogive  model  to  the  one  in  the  logistic  model  is  approximately  0.938687718  ,  which  is  not 
so  much  less  than  unity.  The  corresponding  ratio  between  the  three-parameter  logistic  model  and  the 
logistic  model  is  (1  —  cg)  ,  which  equals  0.75  when  cg  =  0.25  ,  and  is  as  low  as  0.50  when  cg  =  0.50  . 
The  ratio  between  the  three-parameter  logistic  model  and  the  normal  ogive  model  is  approximately 
0.938687718(1  —  cg)  ,  which  is  a  little  less  than  (1  -  ee)  . 

Figure  3-2  illustrates  that  several  sets  of  substantially  different  parameter  values  in  the  three- 
parameter  logistic  model  can  produce  very  similar  operating  characteristics  of  the  correct  answer.  We 
can  tell  that  the  differences  in  the  values  of  the  discrimination  and  difficulty  parameters  for  these  items 
are  substantial,  and  yet  the  resulting  curves  are  very  close  to  each  other  for  a  wide  range  of  8  .  Simple 
comparison  of  the  two  estimated  discrimination  parameters  is  illegitimate,  therefore,  when  the  estimated 
guessing  parameters  proved  to  be  different  from  each  other,  as  is  usually  the  case  with  actual  data.  Since 
the  estimation  of  the  third  parameter  cg  tends  to  be  most  inaccurate,  this  example  indicates  the  dan¬ 
ger  in  direct  comparisons  of  the  estimated  discrimination  parameters,  and  also  the  estimated  difficulty 
parameters,  across  the  items. 

In  most  cases  the  estimated  guessing  parameter  of  a  multiple-choice  test  item  provides  us  with  some 
other  value  than  the  reciprocal  of  the  number  of  the  alternative  answers.  It  is  reported  that  in  some  cases 
the  estimated  cg  takes  on  quite  high  values  (cf.  Lord,  1980,  Section  2.2).  These  phenomena  suggest 
that  the  philosophy  behind  the  model  is  unrealistic.  Researchers  using  the  three-parameter  logistic 
model  argue,  however,  that  it  still  is  a  convenient  approximation  to  real  operating  characteristics  of 
correct  answers,  because  of  its  simplicity  in  mathematics.  In  a  way  it  is  true.  The  effective  use  of 
the  three-parameter  model  cannot  be  realized,  however,  unless  we  know  the  problems  attributed  to 
the  model,  and  use  the  model  in  such  a  way  that  these  weaknesses  will  not  cause  too  much  noise  and 
inefficiency. 

Investigation  of  the  problems  encountered  when  we  apply  the  three-parameter  logistic  model  to  the 
data  which  actually  follow  the  normal  ogive  model  was  made  earlier  (Samejima,  1984b).  The  data  used 
in  the  study  are  simulated  data  for  two  samples  of  500  and  2,000  hypothetical  examinees,  respectively, 
sampled  from  the  uniform  ability  distribution  for  the  interval  of  8  ,  (-2.5,  2.5).  In  order  to  investigate 
the  effect  of  the  number  of  test  items  on  the  resultant  estimated  parameters  obtained  by  Logist  5,  we 
used:  1)  Ten  Item  Test  and  2)  Thirty-Five  Item  Test,  both  of  which  consist  of  binary  items  following 
the  normal  ogive  model.  The  response  pattern  for  each  hypothetical  subject  was  produced  by  Monte 
Carlo  Method.  Combining  these  two  hypothetical  tests,  we  observed  the  result  of:  3)  Forty-Five  Item 
Test,  and,  in  addition,  we  observed  the  result  of  rather  artificially  created:  4)  Eighty  Item  Test  (cf. 
Samejima,  1984b). 

These  results  suggest  that  there  exists  a  substantial  effect  of  the  assumed  third  parameter,  cg  ,  on  the 
other  two  estimated  item  parameters,  if  the  estimation  is  made  by  molding  the  operating  characteristic 
of  the  correct  answer  into  that  of  the  three-parameter  logistic  model,  when  actually  it  follows  the  normal 
ogive  model.  This  effect  appears  to  be  stronger  on  the  estimated  discrimination  parameter  than  on  the 
estimated  difficulty  parameter.  In  order  to  amend  these  enhancements,  the  discrimination  shrinkage 
factor  and  the  difficulty  reduction  index  were  proposed  (Samejima,  1984b)  by  formulae  (3.4)  and  (3.6), 
respectively. 

(3-3)  =  ?(<^)  • 
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FIGURE  3-2 


Examples  of  the  Operating  Characteristics  of  the  Correct  Answer  in  the  Three- Parameter  Logistic 
Model  (Dotted  Lines),  Together  with  the  One  in  the  Logistic  Model  with  aa  —  1.00  and 
bg  =  —0.64  (Solid  Line).  The  Parameters  for  the  Four  Functions  in  the  Order  of  ag  ,  bg 
and  cg  are:  1.05,  -0.52,  0.10;  1.10,  -0.40,  0.20;  1.15,  -0.27,  0.30;  1.20,  -0.13,  0.40; 

respectively. 
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(3.4) 


?(co)  =  ~log(l  ~  2c*)  log(l  +  c*)  -  log(l  -  c*)"1  . 


(3.5)  b;  =  ba  +  t{c-g\ag)  . 


(3.6)  £(c*  |  ag)  =  (Dag)  1  log(l  +  c*)  -  log(l  -  c'g)  . 

In  these  formulae,  a*  ,  6*  ,  and  c*  indicate  the  estimated  item  discrimination,  difficulty  and  guessing 
parameters  when  the  three-parameter  logistic  model  is  assumed,  respectively.  Some  resulting  estimated 
operating  characteristics  of  the  correct  answer  turned  out  to  be  disastrously  different  from  the  theo- 
rectical  functions,  especially  when  only  ten  binary  test  items  were  included.  We  find  no  substantial 
differences  between  the  results  of  500  Subject  Case  and  2,000  Subject  Case,  indicating  that  increasing 
the  number  of  subjects  from  500  to  2,000  does  not  provide  us  with  a  substantial  gain. 

It  has  been  pointed  out  that  the  three-parameter  logistic  model  does  not  satisfy  the  unique  maximum 
condition  for  the  likelihood  function,  and  this  topic  has  been  thoroughly  discussed  (Samejima,  1973). 
The  expected  loss  of  item  information  for  a  fixed  value  of  6  is  given  by 

(3.7)  /,(*)  -  I'g(9)  =  c(,P2a2|{^(S)}2{l-V'sW})|cy  +  (l-C(,)^(0)]-1  , 
where 

(3.8)  lM»)  =  ll  +  exp{-Patf((0)  -  fcj}]-1  , 


and  Ig{6)  and  /*  (ff)  are  the  item  information  functions  in  the  logistic  and  the  three-parameter  logistic 
models,  respectively.  We  have  for  the  critical  value  {Lg  ,  below  which  the  information  provided  by  the 
correct  answer  to  the  item  following  the  three-parameter  logistic  model  assumes  negative  values 

(3.9)  ig  =  bg  +  (2Dag)~1\ogcg  , 

which  is  strictly  increasing  with  the  increase  in  the  parameter  value  cg  ,  and  also  in  ag  and  in  bg  . 
If,  for  example,  ag  =  1.00  and  bg  =  0.00  ,  §_g  =  —0.473364  for  cg  =  0.20  ,  and  ^  =  —0.407734  for 
cg  =  0.25  .  They  are  considerably  high  values  relative  to  bg  . 

An  important  implication  is  that  6^  is  the  point  of  6  below  which  the  existence  of  a  unique 
maximum  likelihood  estimate  is  not  assured  for  all  the  response  patterns  which  include  the  correct 
answer  to  item  g  .  Although  this  warning  has  been  ignored  by  most  researchers  for  many  years,  a 
recent  research  (Yen,  Burket  and  Sykes,  in  press)  points  out  this  is  happening  much  more  often  than 
people  might  think. 

It  has  been  pointed  out  (Samejima,  1979a,  1982a)  that  there  is  a  certain  constancy  in  the  total 
amount  of  item  information,  regardless  of  the  parameter  values  and  of  specific  functional  formulae  for 
the  operating  characteristic  of  the  correct  answer.  If,  for  example,  the  model  belongs  to  Type  A,  i.e., 
the  operating  characteristic  of  the  correct  answer  is  monotone  increasing  with  sero  and  unity  as  its 
lower  and  upper  asymptotes,  respectively,  then  the  total  area  under  the  curve  of  the  square  root  of  the 
item  information  function  will  equal  x  .  If  the  model  belongs  to  Type  B,  i.e.,  the  same  as  Type  A 
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except  that  the  lower  asymptote  of  the  operating  characteristic  of  the  correct  answer  is  greater  than 
*ero,  as  is  the  case  with  the  three-parameter  logistic  model,  then  the  total  area  will  become 


(3.10) 


*  -  2tan-1Ml-cs)-1)l/2  , 


with  the  second  and  last  term  as  the  loss  in  the  amount  of  total  item  information.  This  last  term 
is  strictly  a  function  of  cg  .  When  cg  =  0.20  ,  for  example,  the  total  amount  of  item  information 
reduces,  approximately,  to  0.705*  ,  and  when  cg  =  0.25  it  is  approximately  equal  to  0.667*  . 
More  observations  concerning  the  effect  of  noise  in  the  three-parameter  logistic  model  have  been  made 
elsewhere  (Samejima,  1982b). 

As  all  the  above  observations  indicate,  the  addition  of  the  third  parameter,  cg  ,  to  the  logistic  model 
creates  many  negative  results.  We  have  seen  that  these  negative  effects  are  greater  for  larger  values  of 
cg  .  In  using  the  three-parameter  logistic  model  as  an  approximation  to  real  operating  characteristics, 
therefore,  we  need  to  take  these  facts  into  consideration.  Among  others,  if  we  are  in  a  situation  where 
we  can  modify  or  revise  our  items,  we  must  try  to  reduce  the  effect  of  noise  coming  from  cg  as  much 
as  possible.  Strategies  of  writing  the  multiple-choice  test  items  must  be  considered  accordingly. 


IV  Informative  Distractors  of  the  Multiple-Choice  Test  Item 

So  far  most  observations  and  discussion  have  been  focued  on  theory.  Applications  of  certain  non- 
parametric  methods  of  estimating  the  operating  characteristics  for  some  empirical  data  have  revealed, 
however,  that  many  multiple-choice  test  items  do  not  follow  the  three-parameter  model,  nor  do  they 
follow  the  Equivalent  Distractor  Model  in  general,  to  which  the  three-parameter  logistic  model  belongs. 
Those  items  can  best  be  interpreted  by  the  Informative  Distractor  Model. 

Figure  4-1  presents  two  examples  of  the  set  of  operating  characteristics  of  the  four  alternative 
answers  to  an  item  taken  from  the  Level  11  Vocabulary  Test  of  the  Iowa  Test  of  Basic  Skills,  which  were 
estimated  by  the  Simple  Sum  Procedure  of  the  Conditional  P.D.F.  Approach  combined  with  the  Normal 
Approach  Method  (Samejima,  1981).  We  can  see  in  these  graphs  that  each  distractor  has  its  own  unique 
operating  characteristic,  or  plausibility  function,  and  also  the  estimated  operating  characteristic  of  the 
correct  answer  is  fairly  close  to  the  one  in  the  normal  ogive  model,  which  is  drawn  by  a  solid  line 
in  the  figure.  This  set  of  operating  characteristics  can  better  be  represented  by  one  of  the  family  of 
models  proposed  for  the  multiple-choice  test  item,  which  was  originated  by  the  philosophy  described 
in  Section  2  and  takes  account  of  the  unique  information  provided  by  each  distractor  as  well  as  the 
effect  of  the  examinee’s  random  guessing  behavior  (cf.  Samejima,  1979b).  Figure  4-2  illustrates  the 
operating  characteristic  of  the  correct  answer  in  Model  A.  We  can  see  that  it  is  very  close  to  the  one 
in  the  normal  ogive  model  which  is  drawn  by  a  dotted  line,  except  for  the  lower  part  of  the  curve,  the 
conditional  probability  of  success  which  is  almost  entirely  caused  by  random  guessing.  In  cases  like  this, 
it  will  be  wise  to  approximate  the  curve  by  the  normal  ogive  function  by  discarding  the  item  response 
in  estimating  lower  ability,  since  it  provides  us  with  nothing  but  noise,  as  was  discussed  in  the  preceding 
section. 

Detailed  observations  for  the  plausibility  functions  of  distractors  are  made  elsewhere  (Samejima, 
1984a),  for  the  forty-three  items  of  the  Level  11  Vocabulary  Subtest  of  the  Iowa  Test  of  Basic  Skills. 
Similar  discoveries  have  also  been  reported  with  respect  to  many  ASVAB  test  items.  In  those  results, 
it  is  clear  that  separate  wrong  answers  given  as  alternatives  provide  us  with  differential  informations. 
As  long  as  we  adopt  models  like  the  three-parameter  logistic  model,  however,  we  will  never  discover 
nor  can  we  make  use  of  these  differential  informations,  which  can  be  useful  in  ability  estimation  in  the 
sense  that  they  will  substantially  increase  the  accuracy  of  estimation. 
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PROBABILITY 


FIGURE  4-1 

Two  Examples  of  the  Estimated  Operating  Characterise*  of  the  Correct  Answer  (Dotted  Line) 
and  of  the  Three  Detractors  (Dashed  Lines)  Obtained  by  the  Simple  Sum  Procedure  of  the 
Conditional  P.D.F.  Approach  Combined  with  the  Normal  Approach  Method  Together  with 
the  One  for  the  Correct  Answer  Obtained  by  Assuming  the  Normal  Ogive  Model  (Solid 
Line)  Taken  from  the  Level  11  Vocabulary  Subtest  of  the  Iowa  Test  of  Basic  Skills. 
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FIGURE  4-1  (Continued) 
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PROBABILITY 


FIGURE  4-2 

Example  of  the  Operating  Characteristic  of  the  Correct  An*we.  in  Model  A  (Solid  Line) 
Together  with  One  in  the  Normal  Ogive  Model  (Dotted  Line). 
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How  can  we  approach  the  plausibility  functions  of  distractors?  One  way  of  handling  this  issue 
may  be  to  adopt  a  model  that  belongs  to  the  family  of  models  (Samejima,  1979b)  described  earlier, 
and  estimate  the  parameters  involved  in  the  model.  Another  more  scientific  approach  may  be  to  use 
a  nonparametric  method  of  estimating  the  operating  characteristic.  In  this  way,  we  shall  be  able  to 
discover  the  plausibility  functions  of  our  distractors,  rather  than  mold  them  into  some  mathematical 
formula.  This  will  be  discussed  in  the  subsequent  section,  proposing  a  new  approach  to  item  analysis. 
Practical  suggestions  in  writing  items,  which  take  all  these  facts  and  observations  into  account,  will  be 
given  in  Section  6. 


V  Merits  of  the  Nonparametric  Approach  for  the  Identifica¬ 
tion  of  Informative  Distractors  and  for  the  Estimation  of 
the  Operating  Characteristics  of  an  Item 

Methods  and  approaches  developed  for  estimating  the  operating  characteristics  of  discrete  item 
responses  without  assuming  any  mathematical  form  {cf.  Samejima,  1981,  1990)  enable  us  to  find  out 
whether  or  not  a  given  incorrect  alternative  answer  to  a  multiple-choice  test  item  is  informative  in  the 
sense  that  it  contributes  to  the  increment  in  the  accuracy  in  the  estimation  of  the  individual’s  ability. 
In  the  past  years  various  sets  of  data  based  upon  the  Vocabulary  Subtest  of  the  Iowa  Tests  of  Basic 
Skills,  upon  Shiba’s  Word/Phrase  Comprehension  Tests,  ASVAB  Tests  of  Word  Knowledge  and  of  Math 
Knowledge,  etc.,  have  been  analyzed  by  using,  mainly,  the  Simple  Sum  Procedure  of  Conditional  P.D.F. 
Approach  combined  with  the  Normal  Approach  Method  (cf.  Samejima,  1981).  Recently,  the  author 
proposed  a  new  approach,  which  is  called  Differential  Weight  Procedure  of  the  Conditional  P.D.F. 
Approach.  This  new  method  has  been  introduced  and  discussed  in  a  separate  paper  (Samejima,  1990). 
Here  we  shall  only  introduce  the  essence  of  the  method,  and  illustrate  some  results  on  simulated  data. 

The  item  response  information  function,  Ik,(&)  >  is  defined  by 

(51)  ^W  =  -^log  Pk,(6)  . 

The  item  information  function  Ia(8)  is  the  conditional  expectation  of  the  item  response  information 
function,  given  8  ,  so  that  we  can  write 

(52)  Ig(9)  =  E\Ik,[6)\e\  =  Y.Ik,{*)Pk'W)  • 

kg 

It  can  be  shown  that  the  test  information  function,  1(6)  ,  which  is  defined  as  the  conditional  expectation, 
given  8  ,  of  the  response  pattern  information  function  (Samejima,  1981),  equals  the  sum  total  of  the 
n  item  information  function,  i.e., 

(5.3)  1(9)  =  1,(6)  . 

0=1 

Let  r  be  a  strictly  increasing  transformation  of  6  ,  for  which  we  can 

(5.4)  r  =  r(6)=Cr1  f  (/(f)]1/2  dt  +  C0  , 

J -OO 

where  Co  is  an  arbitrary  constant  for  adjusting  the  origin  of  r  ,  and 
value  equal  to  the  constant  amount  of  test  information  for  the  range  of 


write 


Ci  is  an  arbitrary  positive 
t  of  interst,  respectively.  In 
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ROUND*:  ORICJTEM  POOL;EST.OPER.CHAR.  CDl,CD2pPM  FITTED  BY  PROGS;  MU  OR  S*2»,  ***4,  **M;  *Sm>H 
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FIGURE  5-1 

Examples  of  the  Estimated  Operating  Characteristic  of  the  Correct  Answer  (Dotted  Line) 
in  Comparison  with  the  TVue  Operating  Characteristic  (Solid  Line).  Differentia]  Weight 
Procedure  of  the  Conditional  P.D.F.  Approach  Based  upon  the  Simple  Sum  Procedure 
Combined  with  the  Normal  Approach  Method  was  Used  for  the  Estimation.  Also 
Presented  is  the  Operating  Characteristic  in  the  Three-Parameter  Logistic 
Model  Fitted  to  the  True  Operating  Characteristic  by  Dr.  Michael  Levine 

(Dashed  Line). 
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FIGURE  5-1  (Continued) 
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FIGURE  5-1  (Continued) 
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FIGURE  5-1  (Continued) 


PROBABILITY 


ROUND*  ORIGJTEM  POOL;EST.OPER.CHAR_  CD1,CD2;3PM  FITTED  BY  PROG8;  8616  OR  8»2#,  H«4,  *M«;  85/31/M 


THETA 


0.641  0  80  V50  6.W6.51 

CDR90126.DAT.  INCDR9.  by  NANCY  DO  MW 


FIGURE  5-1  (Continued) 
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FIGURE  5-1  (Continued) 
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FIGURE  5-1  (Continued) 
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the  Simple  Sum  Procedure  of  the  Conditional  P.D.F.  Approach,  the  estimated  operating  characteristic 
Pkk(6)  of  the  discrete  response  kh  to  an  unknown  item  h  is  obtained  by 

(5.5)  pkkw  =  =  22  i  i 

«»*»  •  =  ! 

where  a  is  an  individual,  N  is  the  number  of  individuals,  and  ^(t  |  f,)  is  the  estimated  conditional 
density  of  r  ,  given  the  maximum  likelihood  estimate  f,  of  t  for  the  individual  a  .  In  the  Normal 
Approach  Method,  4>(t  \  t,)  is  obtained  by  fitting  the  normal  density  function  to  4>(t  |  f,)  ,  using  the 
estimated  first  and  second  conditional  moments  of  r  ,  given  f,  ,  as  its  two  parameters.  We  have 

(56)  E(r  |  f)  =  f  +  Ci2  J^log  g(f)  , 


and 

(5.7)  Par.(r  |  f)  =  Cf2  [1  +  Cf2  J^log  §(f)]  , 

where  g(f )  is  the  estimated  density  function  of  f  ,  which  can  be  approximated  by  a  polynomial  fitted 
by  the  method  of  moments  (Samejima  and  Livingston,  1979). 

In  the  Differential  Weight  Procedure  of  the  Conditional  P.D.F.  Approach,  we  can  write  for  the 
estimated  operating  characteristic 

(5.8)  PkJff)  =  Pkk [r(0)j  =  £  I  M!"1  - 

where  WkK{r)  is  a  differential  weight  function.  Since  this  function  involves  Pkli(T )  (cf.  Same¬ 
jima,  1990),  which  itself  is  the  target  of  estimation,  we  may  use  its  estimate  obtained  by  the  Simple 
Sum  Procedure  combined  with  the  Normal  Approach  Method  as  its  initial  approximation,  with  some 
modifications. 

Figure  5-1  presents  eight  examples  of  the  estimated  operating  characteristic  of  the  correct  answer  of 
the  dichotomous  test  item  obtained  by  using  the  Differential  Weight  Procedure.  They  axe  based  upon 
simulated  data  provided  by  Dr.  Charles  Davis  of  the  Office  of  Naval  Research.  The  true  operating 
characteristic  of  the  correct  answer  for  each  item  is  available,  therefore,  and  it  is  also  drawn  in  each 
graph.  In  each  graph,  also  presented  is  the  operating  characteristic  in  the  three-parameter  logistic 
model  fitted  to  the  true  operating  characteristic  by  Dr.  Michael  Levine.  As  you  can  see  in  this  figure, 
Differential  Weight  Procedure  of  the  Conditional  P.D.F.  Approach  will  provide  us  with  fairly  good 
estimates  of  operating  characteristics,  if  we  choose  suitable  differential  weight  functions. 

An  important  implication  of  these  results  is  that  the  nonparametric  approach  for  estimating  the 
operating  characteristic  has  succeeded  in  approximating  non-monotnnic  functions.  This  is  essential  in 
using  any  method  for  the  estimation  of  plausibility  functions.  Although  we  need  more  research  for 
improving  the  fitnesses  further,  these  results  give  us  promises  for  success  in  identifying  informative 
distractors  and  in  estimating  their  operating  characteristics. 

Item  analysis  has  a  long  history,  starting  from  the  classical  proportion  correct  and  item-test  regres¬ 
sion.  In  the  context  of  latent  trait  models,  the  operating  characteristics  and  the  information  functions 
have  provided  us  with  powerful  tools.  Now  we  can  add  the  plausibility  functions  of  the  distractors  to 
this  category.  By  accurately  identifying  the  configuration  of  the  operating  characteristics  of  the  correct 
answer  and  the  distractors,  we  shall  be  able  to  understand  the  characteristics  of  the  item,  its  strengths 
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and  weaknesses.  In  this  way  modifications  of  the  item  can  be  done  if  necessary.  Successful  nonpara- 
metric  methods  of  estimating  the  operating  characteristics  are  essential,  therefore,  for  this  new,  more 
informative  approach  to  the  item  analysis. 


VI  Efficiency  in  Ability  Estimation  and  Strategies  of  Writing 
Test  Items 

Observations  and  discussion  made  in  the  preceding  sections  give  us  many  useful  informations  as 
well  as  warnings.  First  of  all,  theoretical  observations  indicate  that  non-monotonicity  of  the  operating 
characteristic  of  the  correct  answer  to  the  multiple-choice  test  item  is  a  natural  consequence  of  theory. 
Secondly,  it  has  been  shown  from  several  different  angles  that  the  third  parameter,  cg  ,  in  the  three- 
parameter  model  provides  us  with  nothing  but  noise;  the  greater  the  value  of  cg  the  more  noise 
and  inaccuracies  in  estimation  it  produces.  Thirdly,  it  has  been  pointed  out  that,  although  it  is  still 
a  common  procedure  for  researchers  to  mold  the  operating  characteristics  of  the  correct  answers  of 
their  multiple-choice  test  items  into  the  three-parameter  logistic  model,  some  nonparametric  methods 
applied  to  empirical  data  have  revealed  the  non-monotonicity  of  the  operating  characteristic  of  the 
correct  answer  with  many  actual  test  items,  as  well  as  differential  informations  provided  by  separate 
distractors.  Fourthly,  it  has  been  pointed  out  that  nonparametric  approach  to  the  estimation  of  the 
operating  characteristics  of  discrete  responses  has  been  successful  enough  to  detect  the  non-monotonicity 
of  the  function  when  it  exists,  and  to  approximate  their  rather  irregular  curves  fairly  accurately. 

With  all  these  facts,  it  is  time  to  reconsider  conventional  strategies  for  item  writing  and  to  propose 
new  strategies. 

The  first  thing  we  need  to  reconsider  is  the  lack  of  sufficient  interactions  between  theorists  and 
people  who  write  test  items.  It  has  been  fairly  common  that:  1)  a  committee  is  organized  for  writing 
test  items  in  a  specified  content  area  or  domain  and  eventually  produces  a  set  of  test  items;  2)  another 
group  of  people  tests  these  items  on  a  small  sample  of  subjects,  screens  the  items  and  then  administers 
the  selected  items  to  larger  groups  of  subjects.  Item  calibration  is  done  on  the  second  stage,  assuming 
some  model  such  as  the  three-parameter  logistic  model,  etc.  In  most  cases,  there  is  practically  no 
feedback  from  theorists  to  item  writers.  If  we  set  a  strategy  that  more  interactions  are  made  between 
the  two  groups  of  people  so  that  the  test  items  are  revised  and  pilot  tested  with  each  interaction,  we 
shall  be  able  to  improve  the  test,  and  the  improvement  will  lead  to  efficiency  in  ability  estimation. 

The  second  thing  we  need  to  reconsider  is  the  simpleminded  avoidance  of  non-monotonicity  of  the 
operating  characteristic  of  the  correct  answer.  While  it  is  not  desirable  for  an  item  to  have  higher 
conditional  probabilities  of  the  correct  answer  on  lower  levels  of  ability  than  on  higher  levels,  selecting 
alternative  answers  so  that  the  “dips”  of  the  operating  characteristic  of  the  correct  answer  be  smoothed 
out  will  lead  to  a  substantially  large  value  of  the  lower  asymptote  of  the  operating  characteristic  in  most 
cases.  We  must  recall  that  even  a  small  number  like  0.2  as  cg  in  the  three-parameter  logistic  model 
is  a  big  nuisance,  as  was  discussed  in  Section  3.  Our  strategy  must  be  that  we  make  the  best  use  of 
those  “dips”,  instead  of  avoiding  them. 

Figure  6-1  presents  the  operating  characteristics  of  the  five  alternative  answers  of  a  hypothesized 
test  item  following  Model  B  (Samejima,  1979b),  with  the  parameter  values:  ag  =  1.50  ,  bi  =  —2.00  , 

=  —1.00  ,  63  =  0.00  ,  64  =  1.00  and  f>6  =  2.00  .  The  subscript  for  each  of  the  five  difficulty 
parameters  indicates  the  order  of  easiness  for  the  examinee  to  be  attracted  to  the  plausibility  of  each 
alternative  answer,  so  that,  in  this  example,  is  indicates  the  difficulty  parameter  of  the  correct 
answer.  We  can  see  in  this  figure  that  a  practical  monotonicity  exists  for  the  operating  characteristic 
of  the  correct  answer  for  the  range  of  6  ,  (—0.5, 00)  ,  and,  more  importantly,  within  this  range  of  6 
its  lower  asymptote  is  very  close  to  zero,  i.e.,  the  nuisance  caused  by  the  non-sero  lower  asymptote  will 
be  gone  as  far  as  we  administer  the  item  to  populations  of  subjects  whose  ability  distributes  on  higher 
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FIGURE  6-1 

Operating  Characteristics  of  the  Five  Alternative  Answers  of  a  Hypothetical  Test  Item 
Following  Model  B,  with  the  Parameter  Values:  ag  =  1.5  ,  bi  =  —2.0  ,  63  =  —1.0  , 
is  =  0.0  ,  64  =  1.0  and  h$  =  2.0  . 
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FIGURE  6-2 

Operating  Characteristics  of  the  Five  Alternative  Answers  of  a  Hypothetical  Test  Item 
in  the  FVee-Response  Situation  Following  the  Logistic  Model  on  the  Graded  Response 
Level,  with  the  Parameter  Values:  o9  =  1.5  ,  fcj  =  —2.0  ,  b3  =  -1.0  , 
is  =  0.0  ,  64  =  1.0  and  b$  =  2.0  . 
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FIGURE  6-3 

Operating  Characteristics  of  the  Correct  Answer  Obtained  by  the  Five  Different 
Redichotomisations  of  the  Graded  Test  Item  Following  the  Logistic  Model,  with 
the  Discrimination  Parameter,  Og  =  1.5  ,  and  the  Difficulty  Parameters, 

61  =  — 2.0  ,  —  — 1.0  ,  —  0.0  ,  64  —  1.0  and  65  ~  2.0  , 

Respectively. 


levels  than  9  -  —0.5  . 

These  operating  characteristics  of  the  five  alternative  answers  in  Figure  6-1  are  originated  from  those 
in  the  logistic  model  on  the  graded  response  level  (Samejima,  1969,  1972)  with  the  same  parameter  values 
(cf.  Samejima,  1979b).  Figures  6-2  presents  the  corresponding  set  of  operating  characteristics  of  the 
correct  answers  in  the  logistic  model.  We  notice  there  is  an  additional  strictly  decreasing  curve  in  this 
figure.  This  curve  represents  the  conditional  probability,  given  6  ,  that  the  examinee  does  not  find 
attractiveness  in  any  alternative  answers.  In  Model  B,  these  people  are  assumed  to  guess  randomly,  so 
in  Figure  6-1  this  curve  does  not  exist,  and  the  conditional  probability  is  evenly  distributed  among  the 
five  alternative  answers  to  account  for  the  rises  in  their  operating  characteristics  at  lower  levels  of  6  . 

Figure  6-3  presents  the  operating  characteristics  of  the  correct  answer  following  the  logistic  model 
on  the  dichotomous  response  level,  which  are  obtained  by  the  five  different  redichotomiiations  of  the 
graded  test  item  exemplified  in  Figure  6-2.  In  these  functions,  ag  =  1.5  is  the  common  discrimination 
parameter,  and  the  difficulty  parameters  are:  bg  =  —2.0  ,  —  1.0  ,0.0 , 1.0  ,2.0  ,  respectively.  This  is  the 
starting  point  of  the  graded  response  model,  which  leads  to  the  operating  characteristics  illustrated  in 
Figure  6-2  (cf.  Samejima,  1969,  1972). 

Suppose  that  two  alternative  answers  which  attract  examinees  of  low  levels  of  9  are  replaced,  and 
the  revised  item  has  bt  =  —3.00  and  i?  =  —1.50  ,  respectively  In  this  situation,  the  operating 
characteristics  of  the  correct  answer  obtained  by  the  first  two  redichotomiiations  are  changed,  as  are 
shown  in  Figure  6-4.  This  revision  leads  to  the  set  of  operating  characteristics  in  the  logistic  model 
on  the  graded  response  level  presented  by  Figure  6-5.  If  we  compare  this  figure  with  Figure  6-2,  we 
can  see  that  the  curve  for  the  category  of  not  attracted  examinees  is  shifted  to  substantially  lower 
levels  of  9  .  Figure  0-6  presents  the  set  of  operating  characteristics  for  this  revised  test  item  following 
Model  B.  In  this  figure  we  can  see  that  the  operating  characteristic  of  the  correct  answer  is  practically 
strictly  increasing  within  the  range  of  9  ,  (— 1.7,  oo)  ,  and  the  pseudo  lower  asymptote  of  the  operating 
characteristic  within  this  range  of  9  is  still  very  close  to  zero. 

A  big  gain  resulting  from  this  revision  is  the  fact  that  the  lower  endpoint  of  the  interval  of  9  in  which 
the  operating  characteristic  of  the  correct  answer  is  practically  monotonic  has  substantially  shifted  to 
the  negative  direction,  while  still  keeping  its  lower  asymptote  practically  zero.  Thus  we  can  avoid  the 
noise  coming  from  the  lower  asymptote  even  if  we  administer  the  item  to  populations  of  examinees 
whose  ability  distributions  are  located  on  lower  levels  of  9  .  In  other  words,  without  sacrificing  the 
accuracy  of  ability  estimation,  the  utility  of  the  item  has  been  substantially  enhanced  by  this  revision. 

The  above  example  suggests  the  following  strategy. 

(1)  If  the  nonparametrically  estimated  operating  characteristic  of  the  correct  answer  to 
an  item  provides  us  with  a  relatively  high  value  of  9  below  which  monotonicity  does 
not  exist,  then  change  the  set  of  distractors  to  include  one  or  more  wrong  answers 
that  attract  examinees  of  very  low  levels  of  ability. 

It  may  sound  difficult  to  do  in  practice.  If  we  pay  attention  to  actually  used  multiple-choice  test 
items,  however,  we  will  come  across  many  wrong  alternative  answers  that  are  attracting  examinees  of 
very  low  levels  of  ability.  To  give  an  example,  the  author  has  come  across  an  arithmetic  item  asking  for 
the  area  of  a  rectangle.  A  substantial  number  of  seventh  graders  chose  the  wrong  alternative  answer 
which  equals  the  sum  of  the  two  sides  of  the  rectangle  of  different  lengths!  It  is  obvious  that  those  who 
did  not  understand  how  to  obtain  the  area  of  a  rectangle  at  all  chose  this  alternative  answer. 

Another  consideration  which  is  important  in  writing  test  items  is  to  keep  the  “pseudo”  lower  asymp¬ 
tote  of  the  operating  characteristic  of  the  correct  answer  close  enough  to  zero ,  as  is  the  case  with  the 
above  example.  This  has  a  great  deal  to  do  with  the  discrimination  powers  of  the  alternative  answers, 
as  well  as  the  configuration  of  the  plausibility  functions.  Figures  6-7  through  6-9  present,  in  the  re¬ 
versed  order,  the  same  set  of  three  figures  as  Figures  6-1  through  6-3,  by  changing  th*  discrimination 
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FIGURE  6-4 

Operating  Characteristics  of  the  Correct  Answer  Obtained  by  the  Five  Different 
Redichotomisations  of  the  Graded  Test  Item  Following  the  Logistic  Model,  with 
the  Discrimination  Parameter,  Og  =  1.5  ,  and  the  Difficulty  Parameters, 

6i  =  -3.0  ,  6j  =  -1.5  ,  4S  =  0.0  ,  b,  =  1.0  and  b6  =  2.0  , 
Respectively. 
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FIGURE  6-5 

Operating  Characteristics  of  the  Five  Alternative  Answers  of  »  Hypothetical  Test  Item 
in  the  FYee-Response  Situation  Following  the  Logistic  Model  on  the  Graded  Response 
Level,  with  the  Parameter  Values:  ae  =  1.5  ,  b j  =  -3.0  ,  6a  =  — 1.5  , 

63  =  0.0  ,  64  =  1.0  and  65  —  2.0  . 
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FIGURE  6-6 

Operating  Characteristic*  of  the  Five  Alternative  Answer*  of  a  Hypothetical  Test  Item 
Following  Model  B,  with  the  Parameter  Values:  ag  =  1.5  ,  =  -3.0  ,  =  —1.5  , 

6j  =  0.0  ,  64  =  1.0  and  =  2.0  . 
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FIGURE  6-7 

Operating  Characteristics  of  the  Correct  Answer  Obtained  by  the  Five  Different 
Redichotomizations  of  the  Graded  Test  Item  Following  the  Logistic  Model,  with 
the  Discrimination  Parameter,  a9  =  1.0  ,  and  the  Difficulty  Parameters, 
b1  =  -2.0  ,  b2  =  -1.0  ,  bs  =  0.0  ,  fc4  =  1.0  and  i6  =  2.0  , 
Respectively. 
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FIGURE  6-8 

Operating  Characteristics  of  the  Five  Alternative  Answer*  of  a  Hypothetical  Te*t  Item 
in  the  FYee- Response  Situation  Following  the  Logistic  Model  on  the  Graded  Response 
Level,  with  the  Parameter  Values:  ag  =  1.0  ,  =  —2.0  ,  63  =  — 1.0  , 

63  =  0.0  ,  64  =  1.0  and  63  =  2.0  . 
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FIGURE  6-9 

Operating  Characteristics  of  the  Five  Alternative  Answers  of  a  Hypothetical  Test  Item 
Following  Model  B,  with  the  Parameter  Values:  ag  =  1.0  ,  61  =  —2.0  ,  6j  =  —1.0  , 
63  =  0.0  ,  64  =  1.0  and  b 5  =  2.0  . 
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parameter  from  a„  =  1.5  to  a„  =  1.0  ,  while  keeping  the  five  difficulty  parameters  unchanged.  If 
we  compare  Figure  6-9  with  Figure  6-1,  we  can  see  a  substantial  enhancement  of  the  “pseudo*  lower 
asymptote  within  the  interval  of  6  ,  (— 0.5, oo)  ,  i.e.,  the  nuisance  has  been  increased  by  the  change  of 
the  discrimination  parameter. 

This  suggests  the  second  strategy: 

(2)  If  possible,  try  to  include  distractors  whose  estimated  operating  characteristics  are 
steep,  while  keeping  the  differential  configuration  of  the  these  functions,  as  is  illustrated 
in  Figure  6-4. 

So  far  our  strategies  have  been  focused  upon  producing  an  informative  operating  characteristic  of 
the  correct  answer.  We  notice,  however,  that  these  strategies  will  also  provide  vs  with  distractors  which 
provide  us  with  differential  informations.  This  implies  that  approximation  of  the  nonparametrically 
estimated  operating  characteristics  of  one  or  more  alternative  answers  by  some  mathematical  formulae 
will  enable  us  to  use  these  additional  differential  informations  in  ability  estimation.  This  posterior  pa¬ 
rameterization  of  the  non-parametrically  estimated  operating  characteristics  of  distractors  will  certainly 
lead  us  to  the  increased  accuracy  and  efficiency  in  ability  measurement. 


VII  Discussion  and  Conclusions 

The  present  paper  summarizes  the  shortages  of  the  conventional  way  of  handling  the  multiple-choice 
test  and  also  describes  theories  and  methodologies  that  can  be  applied  for  a  better  handling  of  the 
multiple-choice  test  item;  some  empirical  facts  are  introduced  to  support  the  theoretical  observations; 
finally,  new  strategies  of  item  writing  are  proposed  which  will  reduce  noise  and  lead  to  more  efficient 
ability  estimation. 

In  spite  of  many  controversies  against  the  multiple-choice  test,  because  of  its  economy  in  scoring 
it  has  been,  and  still  is,  very  popular  among  people  of  psychological  and  educational  measurement. 
Fortunately,  theorists  in  mathematical  psychology  have  developed  many  new  ideas  and  methodologies  in 
the  past  couple  of  decades  that  can  improve  the  way  of  handling  the  multiple-choice  test.  Nonparametric 
approach  in  estimating  the  operating  characteristic  is  one  of  them.  Also  the  rapid  progress  in  electronic 
technologies  has  made  it  possible  to  materialize  these  results  of  theories  and  methodologies  in  practical 
situations.  Today,  we  are  in  a  position  to  take  advantage  of  all  these  accomplishments. 
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